A. About Bellabeat

Bellabeat manufactures high-tech health-focused smart products. Its Co-founder and artist Sršen has helped develop beautiful designed tech that empowers and inspires women all over the world. Bellabeat collects data on women activity, sleep, stress, and reproductive health. Founded in 2013, Bellabeat has steadily and quickly grown to position itself as a wellness tech company for women around the world.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products are available on Bellabeat.com and other online retail shops.

Business Questions

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Business task

To identify market opportunities for growth and provide high-level recommendations to Bellabeat to help guide the company’s marketing strategy based on trends in smart device usage.

B. Preparing and Cleaning Data

The FitBit Fitness Tracker Dataset has been recommended by Bellabeat’s Co-founder and Chief Creative Officer, Urška Sršen. ### Content This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

Acknowledgements

The FitBit Fitness Tracker Dataset has been pulled from Kaggle https://www.kaggle.com/datasets/arashnic/fitbit

Credits: Furberg, Robert; Brinton, Julia; Keating, Michael ; Ortiz, Alexa

B2. Exploratory Data Analysis

B.2.1 Daily Activity Merged DF

I’m choosing to do this analysis in R because R affords me the sharing my notebook easily with colleagues and it’s also easily reproduced and pulled from GitHub repositories.

First off, I’ll be importing three (3) files from the dataset; * dailyActivity_merged * hourlyCalories_merged * sleepDay_merged

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(stringr)

daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity) 
## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

Fetching data summaries of the data and checking for missing values

summary(daily_activity)
##        Id            ActivityDate         TotalSteps    TotalDistance   
##  Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##  TrackerDistance  LoggedActivitiesDistance VeryActiveDistance
##  Min.   : 0.000   Min.   :0.0000           Min.   : 0.000    
##  1st Qu.: 2.620   1st Qu.:0.0000           1st Qu.: 0.000    
##  Median : 5.245   Median :0.0000           Median : 0.210    
##  Mean   : 5.475   Mean   :0.1082           Mean   : 1.503    
##  3rd Qu.: 7.710   3rd Qu.:0.0000           3rd Qu.: 2.053    
##  Max.   :28.030   Max.   :4.9421           Max.   :21.920    
##  ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
##  Min.   :0.0000           Min.   : 0.000      Min.   :0.000000       
##  1st Qu.:0.0000           1st Qu.: 1.945      1st Qu.:0.000000       
##  Median :0.2400           Median : 3.365      Median :0.000000       
##  Mean   :0.5675           Mean   : 3.341      Mean   :0.001606       
##  3rd Qu.:0.8000           3rd Qu.: 4.782      3rd Qu.:0.000000       
##  Max.   :6.4800           Max.   :10.710      Max.   :0.110000       
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
dplyr::glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
skimr::skim(daily_activity)
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09 ▇▅▃▅▅
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04 ▇▇▁▁▁
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00 ▇▁▁▁▁
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01 ▇▁▁▁▁
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00 ▇▁▁▁▁
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01 ▆▇▆▁▁
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01 ▇▁▁▁▁
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02 ▇▁▁▁▁
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02 ▇▁▁▁▁
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02 ▅▇▇▃▁
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03 ▁▁▇▅▇
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03 ▁▆▇▃▁

A quick inference from the summaries pulled here; For the dailyActivity_merged DF, there are 940 observations and 15 variables 1 charater column and 14 numeric columns, with the character column being dates which should be converted to a date type. * No missing values in the dailyActivity_merged DF

Convert character column to date type

library(dplyr)
library(tidyr)
library(stringr)

daily_activity <- daily_activity %>%
  mutate(new_activity_date = as.Date(ActivityDate, format = "%m/%d/%Y")) %>% 
  mutate(day_of_week = weekdays(new_activity_date)) %>% 
  mutate(People_ID = as.character(Id))

#View(daily_activity) confirms the new_activity_date contains the dates in Date format
# Or the class(daily_activity$new_activity_date) at the console returns Date as the format
# NOTICE here, day of the week column is also created

B.2.2 moving on to the hourlyCalories_merged dataframe;

library(readr)
library(dplyr)
library(tidyr)
library(stringr)

hourly_calories <- read_csv("hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hourly_calories) #fetching first 6 rows
## # A tibble: 6 × 3
##           Id ActivityHour          Calories
##        <dbl> <chr>                    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366 4/12/2016 1:00:00 AM        61
## 3 1503960366 4/12/2016 2:00:00 AM        59
## 4 1503960366 4/12/2016 3:00:00 AM        47
## 5 1503960366 4/12/2016 4:00:00 AM        48
## 6 1503960366 4/12/2016 5:00:00 AM        48
anyNA(hourly_calories) #checking for missing values
## [1] FALSE
summary(hourly_calories)
##        Id            ActivityHour          Calories     
##  Min.   :1.504e+09   Length:22099       Min.   : 42.00  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 63.00  
##  Median :4.445e+09   Mode  :character   Median : 83.00  
##  Mean   :4.848e+09                      Mean   : 97.39  
##  3rd Qu.:6.962e+09                      3rd Qu.:108.00  
##  Max.   :8.878e+09                      Max.   :948.00

The summary shows there are no missing values on the hourly_calories DF. Summary also shows the minimum calories burnt is 42 and the max is 948 calories.

Now moving on to compute new columns for the hourly_calories Df;

library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
hourly_calories <- hourly_calories %>%
  mutate(activity_date = as.Date(ActivityHour, format = "%m/%d/%Y"))%>% 
  mutate(day_of_week = weekdays(activity_date)) %>% 
  mutate(People_ID = as.character(Id))

hourly_calories 
## # A tibble: 22,099 × 6
##            Id ActivityHour          Calories activity_date day_of_week People_ID
##         <dbl> <chr>                    <dbl> <date>        <chr>       <chr>    
##  1 1503960366 4/12/2016 12:00:00 AM       81 2016-04-12    Tuesday     15039603…
##  2 1503960366 4/12/2016 1:00:00 AM        61 2016-04-12    Tuesday     15039603…
##  3 1503960366 4/12/2016 2:00:00 AM        59 2016-04-12    Tuesday     15039603…
##  4 1503960366 4/12/2016 3:00:00 AM        47 2016-04-12    Tuesday     15039603…
##  5 1503960366 4/12/2016 4:00:00 AM        48 2016-04-12    Tuesday     15039603…
##  6 1503960366 4/12/2016 5:00:00 AM        48 2016-04-12    Tuesday     15039603…
##  7 1503960366 4/12/2016 6:00:00 AM        48 2016-04-12    Tuesday     15039603…
##  8 1503960366 4/12/2016 7:00:00 AM        47 2016-04-12    Tuesday     15039603…
##  9 1503960366 4/12/2016 8:00:00 AM        68 2016-04-12    Tuesday     15039603…
## 10 1503960366 4/12/2016 9:00:00 AM       141 2016-04-12    Tuesday     15039603…
## # ℹ 22,089 more rows

B.2.3 Now to the sleepDay_merged DF

library(readr)
library(dplyr)
library(tidyr)
library(stringr)

sleepday <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sleepday) 
## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320
anyNA(sleepday) # retuns false -- No missing values
## [1] FALSE

C. Visualizations and Deductions

C1. Tracking calories burnt per participant

library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)

hourly_calories <- hourly_calories %>%
  mutate(activity_date = as.Date(ActivityHour, format = "%m/%d/%Y"))%>% 
  mutate(day_of_week = weekdays(activity_date)) %>% 
  mutate(People_ID = as.character(Id))

hourly_calories %>% 
  ggplot(aes(x = People_ID, y = Calories, color = day_of_week)) +
  geom_col() + coord_flip() + theme_minimal() -> plot

ggplotly(plot)
0300006000090000150396036616245800811644430081184450507219279722792022484408202635203523201270022347167796287321276533728681643977333714402033265040571929124319703577438816184744451149864558609924470292168455539574435577150313611766616062908550056775888955696218106770077441717086361926805347532882532428798378563200858381505987920096658877689391
day_of_weekFridayMondaySaturdaySundayThursdayTuesdayWednesdayCaloriesPeople_ID

Mean in Calories

library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)

hourly_calories %>% 
  group_by(day_of_week) %>% 
  summarise(mean_in_calories = mean(Calories, na.rm = TRUE)) %>% 
  fashion()  
##   day_of_week mean_in_calories
## 1      Friday            97.78
## 2      Monday            97.05
## 3    Saturday            99.87
## 4      Sunday            94.34
## 5    Thursday            97.01
## 6     Tuesday            98.62
## 7   Wednesday            96.87

This suggests there’s good correlation in the calculated means in calories burnt – an indication that the participants used the Hourly Calories app everyday for the 3 months period of activity. It also shows participants rested a little bit more on Sundays.

C2. Total minutes slept

library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)

# Converting Id column to People_ID character format and creating new computed columns, day_of_week and activity_date
sleepday <- sleepday %>%
  mutate(activity_date = as.Date(SleepDay, format = "%m/%d/%Y"))%>% 
  mutate(day_of_week = weekdays(activity_date)) %>% 
  mutate(People_ID = as.character(Id))

sleepday %>% 
  ggplot(aes(x = People_ID, y = TotalMinutesAsleep, color = day_of_week)) +
  geom_col() + coord_flip() + theme_minimal() -> plot

ggplotly(plot)
050001000015000150396036616444300811844505072192797227920263520352320127002234716779639773337144020332650431970357743881618474445114986455860992447029216845553957443557715031361176661606775888955696218106770077441717086361926805347532883785632008792009665
day_of_weekFridayMondaySaturdaySundayThursdayTuesdayWednesdayTotalMinutesAsleepPeople_ID

Computing the average sleep in minutes

library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)

sleepday %>% 
  group_by(day_of_week) %>% 
  summarise(mean_TotalMinutesAsleep = mean(TotalMinutesAsleep, na.rm = TRUE)) %>% 
  fashion() 
##   day_of_week mean_TotalMinutesAsleep
## 1      Friday                  405.42
## 2      Monday                  418.83
## 3    Saturday                  420.81
## 4      Sunday                  452.75
## 5    Thursday                  402.37
## 6     Tuesday                  404.54
## 7   Wednesday                  434.68

This validates participants slept more on Sundays during the 3-month period as average in minutes of sleep is highest on Sunday.

C3. Daily Activity Plots

C.3.1 Very Active Minutes VS Calories

library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)

daily_activity %>% 
  ggplot(aes(x = VeryActiveMinutes, y = Calories, color = day_of_week)) +
  geom_jitter() + geom_abline() + theme_minimal() +
  facet_wrap(~ People_ID) -> plot

ggplotly(plot)
010002000300040005000010002000300040005000010002000300040005000010002000300040005000010002000300040005000050100150200010002000300040005000050100150200050100150200050100150200050100150200050100150200
day_of_weekFridayMondaySaturdaySundayThursdayTuesdayWednesdayVeryActiveMinutesCalories150396036616245800811644430081184450507219279722792022484408202635203523201270022347167796287321276533728681643977333714402033265040571929124319703577438816184744451149864558609924470292168455539574435577150313611766616062908550056775888955696218106770077441717086361926805347532882532428798378563200858381505987920096658877689391

This plot shows a linear relationship between VeryActiveMinutes and Calories

C.3.2 Sedentary Minutes VS Calories

010002000300040005000010002000300040005000010002000300040005000010002000300040005000010002000300040005000050010001500010002000300040005000050010001500050010001500050010001500050010001500050010001500
day_of_weekFridayMondaySaturdaySundayThursdayTuesdayWednesdaySedentaryMinutesCalories150396036616245800811644430081184450507219279722792022484408202635203523201270022347167796287321276533728681643977333714402033265040571929124319703577438816184744451149864558609924470292168455539574435577150313611766616062908550056775888955696218106770077441717086361926805347532882532428798378563200858381505987920096658877689391

This sure shows a decline in calories burnt as sedentary minutes increases per participant.

C.3.3 Total Steps VS Calories

library(ggplot2)
library(dplyr)
library(tidyr)
library(plotly)

daily_activity %>% 
  ggplot(aes(x = TotalSteps, y = Calories, color = People_ID)) +
  geom_point() + geom_smooth() + theme_minimal() +
  facet_wrap(~ day_of_week) -> plot

ggplotly(plot)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
-2000002000040000-20000020000400000100002000030000-200000200004000001000020000300000100002000030000
People_ID150396036616245800811644430081184450507219279722792022484408202635203523201270022347167796287321276533728681643977333714402033265040571929124319703577438816184744451149864558609924470292168455539574435577150313611766616062908550056775888955696218106770077441717086361926805347532882532428798378563200858381505987920096658877689391TotalStepsCaloriesFridayMondaySaturdaySundayThursdayTuesdayWednesday

Increase in calories burnt as total steps increase

4. Conclusions

  1. Trends show that participants use the apps less on Sundays.
  2. As sedentary minutes increases, calories burnt decreases.
  3. The more active the participants, the more calories they burn.
  4. Survey participants did not all participate at the same level, with some very active and some not so active using the apps.
  5. Include feedback forms in-app to see what apps customers would use more and see what the reasons are.

5. Recommendations

Having been tested by 30 participants in a 3-month survey and with the results seen, the FitBit Fitness Tracker is recommended as a benchmark to building and/or extending Bellabeat’s own fitness apps. Some cues to note are; * Getting more customers to use the apps everyday. Building friendly notifiers and exercise tracking plans would go a long way here. * Including reward systems in apps to help customers use all the apps in Bellabeat is also recommended.

Thank You. Silas O. Bamidele